PARQUET-446: Hide Thrift compiled headers and Boost from public API, #include scrubbing#49
PARQUET-446: Hide Thrift compiled headers and Boost from public API, #include scrubbing#49wesm wants to merge 10 commits intoapache:masterfrom
Conversation
…e of deserialized-related internal headers and code paths. Add unit test to enforce this
|
OK, I finished a bunch of include-what-you-use cleanup, so we can also resolve PARQUET-522 when this is merged. Note that the wall time for compiling |
src/parquet/column/scanner.h
Outdated
|
|
||
| protected: | ||
| size_t batch_size_; | ||
| int32_t batch_size_; |
There was a problem hiding this comment.
Can we keep this int64_t? Why should we limit the batch_size to 2GB of a byte type?
There was a problem hiding this comment.
Will make int64_t -- data pages are limited to int32_t but a row group could in theory be larger.
|
I am little concerned with having our own types for |
|
@majetideepak Well, we have a hard constraint, which is that exposing Thrift includes in the public API is not acceptable. What I've done is in line with parquet-mr's equivalent code (I didn't actually look until now https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/page/DataPageV1.java) From a software layering point of view it is better (IMHO) to decouple serialization-deserialization matters as much as possible. I'm happy of course to look at an alternate proposal that satisfies the constraint that developers using parquet-cpp as a 3rd-party library do not need to include Thrift libraries and headers in their build system. |
|
@julienledem can you review and comment? Thank you |
|
I see your point with respect to Serializer and Deserializer and |
| // assembled in a serialized stream for storing in a Parquet files | ||
|
|
||
| SerializedPageReader::SerializedPageReader(std::unique_ptr<InputStream> stream, | ||
| Compression::type codec) : |
There was a problem hiding this comment.
Should we name this ThriftSerializedPageReader since it is specific to thrift? If we want to move on the lines of parquet-mr, I believe there should be further de-coupling from thrift and this Reader. We probably should have another module like parquet-cpp/parquet-thrift.
There was a problem hiding this comment.
This could be done in a further downstream patch but I want to make sure we are going on the right path. @julienledem, @nongli and others, please pitch in your experience with parquet-mr here.
There was a problem hiding this comment.
Alternatively, we could make parquet-cpp similar to parquet-mr/parquet-thrift (if I understood it correctly) with an additional ability to read and write "batch size" columns.
There was a problem hiding this comment.
I don't agree with this line of reasoning. The Parquet specification uses Thrift for its metadata serialization, which is separate from code that can marshal materialized/reassembled records to and from Thrift or some other RPC/Serialization format (which is the role of parquet-thrift, parquet-avro, and other adapter components in parquet-mr). See for example: https://github.com/apache/parquet-mr/blob/master/parquet-thrift/src/main/java/org/apache/parquet/hadoop/thrift/ThriftToParquetFileWriter.java
There was a problem hiding this comment.
parquet-thrift in parquet-mr is not related to reading parquet metadata (which happens to use thrift) in parquet-format
We use thrift in 2 places:
- read the parquet metadata in the file. We want that one encapsulated and invisible to the users of the library. This is why there's a separate object model additionally to the thrift idl file. The thrift IDL is here to make sure we stay backwards compatible and we have a strict spec for it. The other model is more convenient to work with and abstracts thrift out.
- parquet-thrift is the thrift integration to let users write parquet files using a thrift IDL to define their domain specific schema. It depends on the core parquet components. Not the other way around. It is at the same level as parquet-protobuf, parquet-avro, parquet-pig ...
In particular hiding the thrift version used for the metadata will make sure that it does not conflict with the version used by the end user for their own code generation.
I hope this helps.
There was a problem hiding this comment.
@wesm and @julienledem thanks! for the clarification.
|
+1 LGTM |
|
+1 LGTM |
|
thank you! |
This is the completion of work I started in PARQUET-442. This also resolves PARQUET-277 as no boost headers are included in the public API anymore.
I've done some scrubbing of #includes using Google's Clang-based include-what-you-use tool. PARQUET-522 can also be resolved when this is merged.